A Pilot Study
University of Kentucky
2024-10-25
Dr. Anton Vinogradov,(recent!) PhD, Computer Science
Database of Etymological Roots Beginning in PIE
In DERBi PIE, having an integrated semantic system could provide automated answers to questions such as:
Are certain sound sequences associated with certain meanings or semantic spheres?
Morphological classes or derivations?
How have meanings changed over time into the various branches and daughter languages?
Distributional Analysis: “You shall know a word by the company it keeps.” - John Rupert Firth
For example, you are very likely to see the words “dog” and “leash” appear near each other in a text.
However, you are less likely to see the words “dog” and “physics” as near each other.
There exist programming tools that allow us to generate vectors (essentially coordinates) based on a word’s position within a text and proximity to other words.
All of the vectors that can be generated from a text would lie in a semantic space or hyperspace.
This approach in line with what is done in present-day NLP – identifying semantic relationships through word embeddings.
Before we can run a text through one of these tools, we must first discuss tokenization and lemmatization.
You can tinker with how exactly a word embeddings tool generates these vectors, for example:
How many dimensions will it generate for each vector?
How far to each side of a word will it look?
How many times will it run through the text?
Plots like these can be created with generated vectors!
Fancy formula from Wikipedia
If there isn’t a 1:1 correspondence, identify the centroid of lemma’s vector: language-word center: French frêle, fragile : Latin fragilis
Identify the centroid of both lwcs -> inter-language-word center
One of these things is not like the other!
We believe that our reconstruction model shows promise
We plan to stick with the Descendant Model strategy, but:
Utilize LLMs (such as GPT) for modelling (hyperspace) for greater precision and differentiation of polysemy
Instead of Google Translate, use bilingual dictionary or LLM
Add additional Romance languages;
When happy with results, move on to other subbranches (likely Slavic or Indic) for testing;
We believe that a good, workable model should be able to generate a *proto-vector through the centroid of the descendant language-word centers (minus those additional steps).
Let’s finish with where we began – DERBi PIE
With aligned hyperspaces generated for each descendant and reconstructed language, our models would give us coordinates for each lexeme in a language
Possible question: “How semantically similar are PIE *sC- roots – as compared to related *C- roots?” (*(s)peḱ- ‘look at’)
Example from English:
bl- words: “blue”, “blaze”, “bland”, “blush”, “blink”, “blow”, “blast”, “blot”, “blend”, “bleak”
These words have a cosine similarity score of .8016 (remember: perfect score is 1.0!)